Machine Learning: AllLife Bank Personal Loan Campaign

Problem Statement

Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries

In [1]:
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [2]:
# import libraries for data manipulation,data visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#import libraries for decision tree,metrics scores
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    make_scorer,
)

#import library to ignore warnings
import warnings
warnings.filterwarnings("ignore")

Loading the dataset

In [3]:
# import drive from google colab and mount it
from google.colab import drive
drive.mount ('/content/drive')
Mounted at /content/drive
In [4]:
#load the csv data
data=pd.read_csv('/content/drive/MyDrive/Loan_Modelling.csv')

Data Overview

  • Observations
  • Sanity checks
In [5]:
#display first 10 rows of dataset
data.head(10)
Out[5]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
5 6 37 13 29 92121 4 0.4 2 155 0 0 0 1 0
6 7 53 27 72 91711 2 1.5 2 0 0 0 0 1 0
7 8 50 24 22 93943 1 0.3 3 0 0 0 0 0 1
8 9 35 10 81 90089 3 0.6 2 104 0 0 0 1 0
9 10 34 9 180 93023 1 8.9 3 0 1 0 0 0 0

There are 14 columns in the dataframe. Each row represents personal details and details related to bank

  • ID represents customer ID
  • Age, Experience, Income represents Age of the customer, their work experience and their Income
  • ZIPCode, Family represents the location where customer lives, Family represents the number of family members
  • CCAvg represents Average credit card spending per month in thousands
  • Education represents level of education. 1 - Undergraduate, 2 - Graduate, 3 - Professional
  • Mortgage represents house mortgage value in thousands
  • Personal_Loan represents if customer accepted loan in last campaign. 0 - No, 1 - Yes. (This will be our target)
  • Securities_Account represents if customer has any securities account. 0 - No, 1 - Yes
  • CD_Account represents if customer has Cash of Deposit account. 0 - No, 1 - Yes
  • Online represents if customer access internet banking. 0 - No, 1 - Yes
  • Creditcard represents if customer has credit card from different bank. 0 - No,1 - Yes
In [6]:
#to get number of rows and columns
data.shape
Out[6]:
(5000, 14)

There are 5000 rows and 14 columns in the Dataframe

In [7]:
#copy dataframe and drop ID column
df=data.copy()
df.drop('ID',axis=1,inplace=True)

Dataframe is copied and ID column is dropped since there is no significance

In [8]:
#getting info on dataframe
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 5000 non-null   int64  
 1   Experience          5000 non-null   int64  
 2   Income              5000 non-null   int64  
 3   ZIPCode             5000 non-null   int64  
 4   Family              5000 non-null   int64  
 5   CCAvg               5000 non-null   float64
 6   Education           5000 non-null   int64  
 7   Mortgage            5000 non-null   int64  
 8   Personal_Loan       5000 non-null   int64  
 9   Securities_Account  5000 non-null   int64  
 10  CD_Account          5000 non-null   int64  
 11  Online              5000 non-null   int64  
 12  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(12)
memory usage: 507.9 KB
  • All column values are numerical.
  • There are 5000 observations in each column which explains there are no explicit missing values
  • The columns such as Education, Personal_Loan, Securities_Account, CD_Account, Online, Creditcard, though have numbers can be considered as categorical column since they represents Yes or No
  • There are many ZIPCode values. First 2 values represent area, the customer lives. So, we will have to work on this column. Though it seems numerical, it could be a categorical object.
In [9]:
df.describe().T
Out[9]:
count mean std min 25% 50% 75% max
Age 5000.0 45.338400 11.463166 23.0 35.0 45.0 55.0 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.0 20.0 30.0 43.0
Income 5000.0 73.774200 46.033729 8.0 39.0 64.0 98.0 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.0 93437.0 94608.0 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.0 2.0 3.0 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.7 1.5 2.5 10.0
Education 5000.0 1.881000 0.839869 1.0 1.0 2.0 3.0 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.0 0.0 101.0 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.0 0.0 0.0 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.0 0.0 0.0 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.0 0.0 0.0 1.0
Online 5000.0 0.596800 0.490589 0.0 0.0 1.0 1.0 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.0 0.0 1.0 1.0
  • Customers' are from 23 to 67 years old
  • Experience is from -3 to 43 years. Minimum value of -3 is not applicable. Experience cannot be negative. Hence, these values should be checked
  • Average value of Income is 74K dollars
  • Family size varies from 1 to 4 members
  • 50% of customeres have 1,500$ as average credit card spending
  • Atleast 50% of customers do not have any mortgage
  • The other values are either 0 or 1, which confirms they are categorical.
In [10]:
#check for null values
df.isna().sum()
Out[10]:
Age                   0
Experience            0
Income                0
ZIPCode               0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
dtype: int64

There are no null values

In [11]:
#check for duplicated values
df.duplicated().sum()
Out[11]:
0

There are no duplicate values

In [12]:
#convert zipcode to strings and take first 2 digits
df["ZIPCode"] = df["ZIPCode"].astype(str)
df["ZIPCode"] = df["ZIPCode"].str[0:2]
In [13]:
#check unique zipcode values
df["ZIPCode"].unique()
Out[13]:
array(['91', '90', '94', '92', '93', '95', '96'], dtype=object)

Now, Zipcode has only 7 unique values and is of datatype object.

Exploratory Data Analysis.

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?
In [14]:
# function to plot a boxplot and a histogram along the same scale.

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [15]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [16]:
# selecting numerical columns
num_col = ['Age','Experience','Family','Income','CCAvg','Mortgage']
# plot histogram and boxplot for numerical columns
for column in num_col:
        histogram_boxplot(df, column, kde=True)
        plt.title(f'Univariate Analysis of {column} columns')
        plt.show()
  • Age is uniformly distributed with similar mean and median values
  • There is no outliers/skewness in Age, Experience
  • 1 person family count is the highest among customers
  • Income values are right skewed with outliers having high annual income. But, it is possible to have extreme values.Mean income value is 73K and Median value is approx 65K
  • Credit Card Average spending is right skewed with outliers from 4k. Mean spending is approx 2K and is higher than median value
  • Most of the customers have no mortgage. Mortgage value starts from 90K while the highest is 600K. Outlier for Mortgage starts from 250K.
  • All the outliers are valid and decision tree is robust to outliers. Hence, we will not treat them.
In [17]:
#selecting categorical columns
cat_col = ["Education",
    "Personal_Loan",
    "Securities_Account",
    "CD_Account",
    "Online",
    "CreditCard",
    "ZIPCode"]
#plot bargraph for categorical columns
for column in cat_col:
    labeled_barplot(df, column)
    plt.show()
  • Most customers (2096)are undergradutes, while 1403 are graduates and 1501 are professionals
  • Around 10% of customers got loan from last campaign which counts to 480 out of 5000
  • 522 customers have securities account
  • 302 customers have CD account
  • There are more customers who use online banking compared to those who dont. Around 60% of customers use online banking
  • 3530 customers do not have other bank credit card and 1470 customers have.
  • With the zipcodes, most cutomers are from zipcode - 94 and least from 96
In [18]:
#plot heatmap for numerical categories
plt.figure(figsize=(15, 7))
sns.heatmap(df[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
In [20]:
df.corr()
Out[20]:
Age Experience Income Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
Age 1.000000 0.994215 -0.055269 -0.046418 -0.052012 0.041334 -0.012539 -0.007726 -0.000436 0.008043 0.013702 0.007681
Experience 0.994215 1.000000 -0.046574 -0.052563 -0.050077 0.013152 -0.010582 -0.007413 -0.001232 0.010353 0.013898 0.008967
Income -0.055269 -0.046574 1.000000 -0.157501 0.645984 -0.187524 0.206806 0.502462 -0.002616 0.169738 0.014206 -0.002385
Family -0.046418 -0.052563 -0.157501 1.000000 -0.109275 0.064929 -0.020445 0.061367 0.019994 0.014110 0.010354 0.011588
CCAvg -0.052012 -0.050077 0.645984 -0.109275 1.000000 -0.136124 0.109905 0.366889 0.015086 0.136534 -0.003611 -0.006689
Education 0.041334 0.013152 -0.187524 0.064929 -0.136124 1.000000 -0.033327 0.136722 -0.010812 0.013934 -0.015004 -0.011014
Mortgage -0.012539 -0.010582 0.206806 -0.020445 0.109905 -0.033327 1.000000 0.142095 -0.005411 0.089311 -0.005995 -0.007231
Personal_Loan -0.007726 -0.007413 0.502462 0.061367 0.366889 0.136722 0.142095 1.000000 0.021954 0.316355 0.006278 0.002802
Securities_Account -0.000436 -0.001232 -0.002616 0.019994 0.015086 -0.010812 -0.005411 0.021954 1.000000 0.317034 0.012627 -0.015028
CD_Account 0.008043 0.010353 0.169738 0.014110 0.136534 0.013934 0.089311 0.316355 0.317034 1.000000 0.175880 0.278644
Online 0.013702 0.013898 0.014206 0.010354 -0.003611 -0.015004 -0.005995 0.006278 0.012627 0.175880 1.000000 0.004210
CreditCard 0.007681 0.008967 -0.002385 0.011588 -0.006689 -0.011014 -0.007231 0.002802 -0.015028 0.278644 0.004210 1.000000
  • Age and Experience are highly correlated. We can drop either of it.
  • Income and Credit Card Average spending is strongly correlated
  • Income and Mortgage is positively correlated
  • Income, Credit Card Average spending and CD Account has strong correlation with Personal loan
  • Correlation does not mean causation
In [21]:
#plot pairplot for numerical columns
sns.pairplot(data=df[num_col], diag_kind="kde")
plt.show()
  • Age and Experience has positive correlation
  • Income with Credit card Average spending and Mortgage has positive correlation
  • There is no other strong correlations
In [22]:
#plot barplot for categorical columns with Personal Loan
sns.catplot(data=df, x="Education",y= "Personal_Loan",kind='bar')
plt.title('Education vs Personal Loan')
plt.show()
sns.catplot(data=df, x="ZIPCode",y= "Personal_Loan",kind='bar')
plt.title('Zipcode vs Personal Loan')
plt.show()
sns.catplot(data=df, x="Securities_Account",y= "Personal_Loan",kind='bar')
plt.title('Securities Account vs Personal Loan')
plt.show()
sns.catplot(data=df, x="CD_Account",y= "Personal_Loan",kind='bar')
plt.title('CD Account vs Personal Loan')
plt.show()
sns.catplot(data=df, x="Online",y= "Personal_Loan",kind='bar')
plt.title('Online vs Personal Loan')
plt.show()
sns.catplot(data=df, x="CreditCard",y= "Personal_Loan",kind='bar')
plt.title('Credit Card vs Personal Loan')
plt.show()
  • Least number of Undergrads received loan while the count of Professionals is the highest for accepting loan
  • With higher education, interest of loan acceptance increases
  • Zipcode - 96 has the lowest mean acceptance of Personal Loan
  • Customers with Securities Account and CD Account has higher acceptance of Personal loan
  • Online and Credit card presence does not have any relation with Personal loan acceptance
In [24]:
#plot boxplot of numerical values with Personal loan without outliers
sns.boxplot(data=df, x="Personal_Loan",y= "Age",showfliers=False)
plt.title('Age vs Personal Loan')
plt.show()
sns.boxplot(data=df, x="Personal_Loan",y= "Experience",showfliers=False)
plt.title('Experience vs Personal Loan')
plt.show()
sns.boxplot(data=df, x="Personal_Loan",y= "Income",showfliers=False)
plt.title('Income vs Personal Loan')
plt.show()
sns.boxplot(data=df, x="Personal_Loan",y= "CCAvg",showfliers=False)
plt.title('Credit card usage vs Personal Loan')
plt.show()
sns.boxplot(data=df, x="Personal_Loan",y= "Mortgage",showfliers=False)
plt.title('Mortgage vs Personal Loan')
plt.show()
sns.boxplot(data=df, x="Personal_Loan",y= "Family",showfliers=False)
plt.title('Family vs Personal Loan')
plt.show()
  • There is no impact of Age and Experience with Personal loan acceptance
  • Customers who accepted Personal loan have higher range of income
  • Customers with high median credit card spending has high personal loan acceptance
  • Higher personal loan acceptance is with customers having higher mortgage
  • Large Family has accepted personal loan compared to small families
In [25]:
#plot historam of Age with and without personal loan
sns.histplot(data=df[df['Personal_Loan']==0],x='Age');
sns.histplot(data=df[df['Personal_Loan']==1],x='Age');
In [26]:
#plot historam of Experience with and without personal loan
sns.histplot(data=df[df['Personal_Loan']==0],x='Experience');
sns.histplot(data=df[df['Personal_Loan']==1],x='Experience');
In [27]:
#plot historam of Income with and without personal loan
sns.histplot(data=df[df['Personal_Loan']==0],x='Income');
sns.histplot(data=df[df['Personal_Loan']==1],x='Income');
In [28]:
#plot historam of Family with and without personal loan
sns.histplot(data=df[df['Personal_Loan']==0],x='Family');
sns.histplot(data=df[df['Personal_Loan']==1],x='Family');
In [29]:
#plot historam of Credit card spending with and without personal loan
sns.histplot(data=df[df['Personal_Loan']==0],x='CCAvg');
sns.histplot(data=df[df['Personal_Loan']==1],x='CCAvg');
In [30]:
#plot historam of Mortgage with and without personal loan
sns.histplot(data=df[df['Personal_Loan']==0],x='Mortgage');
sns.histplot(data=df[df['Personal_Loan']==1],x='Mortgage');
  • The plot confirms Age and Experience has no impact on personal loan acceptance
  • The plot confirms higher the income, higher personal loan acceptance
  • Though with more family members, higher is the personal loan acceptance; the difference is not significant

Data Preprocessing

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)
In [31]:
#checking unique values of experince
df["Experience"].unique()
Out[31]:
array([ 1, 19, 15,  9,  8, 13, 27, 24, 10, 39,  5, 23, 32, 41, 30, 14, 18,
       21, 28, 31, 11, 16, 20, 35,  6, 25,  7, 12, 26, 37, 17,  2, 36, 29,
        3, 22, -1, 34,  0, 38, 40, 33,  4, -2, 42, -3, 43])
In [32]:
# Treating the experience values
df["Experience"].replace(-1, 1, inplace=True)
df["Experience"].replace(-2, 2, inplace=True)
df["Experience"].replace(-3, 3, inplace=True)
In [33]:
#checking unique values of experince
df["Experience"].unique()
Out[33]:
array([ 1, 19, 15,  9,  8, 13, 27, 24, 10, 39,  5, 23, 32, 41, 30, 14, 18,
       21, 28, 31, 11, 16, 20, 35,  6, 25,  7, 12, 26, 37, 17,  2, 36, 29,
        3, 22, 34,  0, 38, 40, 33,  4, 42, 43])
  • There is no missing values in the data
  • There are outliers in Income, Credit card spending and Mortgage. They are valid and hence, no treatment

Model Building

Model Evaluation Criterion

  • Since, statement is to find group of potential customers who will accept loan, decision tree is the way to group customers
  • The criteria is to select model of best recall. To select model with low False Negative, we select model with best recall
  • Decision tree, Pre pruning and post pruning are the techniques be used

Model Building

In [34]:
#create dummy variables with drop first to reduce columns
df_coded=pd.get_dummies(df, columns=['Education','ZIPCode'],drop_first=True)
df_coded.head(10)
Out[34]:
Age Experience Income Family CCAvg Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard Education_2 Education_3 ZIPCode_91 ZIPCode_92 ZIPCode_93 ZIPCode_94 ZIPCode_95 ZIPCode_96
0 25 1 49 4 1.6 0 0 1 0 0 0 0 0 1 0 0 0 0 0
1 45 19 34 3 1.5 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 39 15 11 1 1.0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
3 35 9 100 1 2.7 0 0 0 0 0 0 1 0 0 0 0 1 0 0
4 35 8 45 4 1.0 0 0 0 0 0 1 1 0 1 0 0 0 0 0
5 37 13 29 4 0.4 155 0 0 0 1 0 1 0 0 1 0 0 0 0
6 53 27 72 2 1.5 0 0 0 0 1 0 1 0 1 0 0 0 0 0
7 50 24 22 1 0.3 0 0 0 0 0 1 0 1 0 0 1 0 0 0
8 35 10 81 3 0.6 104 0 0 0 1 0 1 0 0 0 0 0 0 0
9 34 9 180 1 8.9 0 1 0 0 0 0 0 1 0 0 1 0 0 0
In [35]:
#drop Experience column and split the data
x=df.drop(['Personal_Loan','Experience'],axis=1)
y=df['Personal_Loan']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=1)
In [36]:
#build model using gini criteria
dtree=DecisionTreeClassifier(criterion='gini',random_state=1)
dtree.fit(x_train,y_train)
Out[36]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [37]:
#check weight distribution
print('Target variable training distribution',y_train.value_counts())
print('Target variable testing distribution',y_test.value_counts())
print('Independent variable training distribution',x_train.shape[0])
print('Independent variable testing distribution',x_test.shape[0])
Target variable training distribution 0    3169
1     331
Name: Personal_Loan, dtype: int64
Target variable testing distribution 0    1351
1     149
Name: Personal_Loan, dtype: int64
Independent variable training distribution 3500
Independent variable testing distribution 1500
In [38]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance(model, independent, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(independent)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_performance = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_performance
In [39]:
def make_confusion_matrix(model, independent, target, labels=[1, 0]):
    '''
    model : classifier to predict values of X
    y_actual : ground truth

    '''
    y_predict = model.predict(independent)
    cm=confusion_matrix( target, y_predict, labels=[0, 1])
    df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
                  columns = [i for i in ['Predicted - No','Predicted - Yes']])
    group_counts = ["{0:0.0f}".format(value) for value in
                cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in
              zip(group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    plt.figure(figsize = (10,7))
    sns.heatmap(df_cm, annot=labels,fmt='')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
In [40]:
#create confusion matrix for training data
make_confusion_matrix(dtree, x_train, y_train)
In [41]:
#do model performance for training and test data; print them
print("Training data performance")
dtree_train_perf = model_performance(dtree, x_train, y_train)
print(dtree_train_perf)
print("Testing data performance")
dtree_test_perf = model_performance(dtree, x_test, y_test)
print(dtree_test_perf)
Training data performance
   Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
Testing data performance
   Accuracy    Recall  Precision        F1
0      0.98  0.885906   0.910345  0.897959
In [42]:
#list out independent variables
features = list(x.columns)
print(features)
['Age', 'Income', 'ZIPCode', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
In [43]:
#plot decision tree
plt.figure(figsize=(20,30))
tree.plot_tree(dtree,feature_names=features,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
In [44]:
#print decision tree
print(tree.export_text(dtree,feature_names=features,show_weights=True))
|--- Income <= 116.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2553.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- Family <= 3.50
|   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |--- weights: [35.00, 0.00] class: 0
|   |   |   |   |--- Education >  1.50
|   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |--- Age <= 41.50
|   |   |   |   |   |   |   |--- weights: [16.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  41.50
|   |   |   |   |   |   |   |--- Age <= 48.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- Age >  48.50
|   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |--- Family >  3.50
|   |   |   |   |--- Age <= 32.50
|   |   |   |   |   |--- CCAvg <= 2.40
|   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.40
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  32.50
|   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account <= 0.50
|   |   |   |   |--- Age <= 26.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  26.50
|   |   |   |   |   |--- CCAvg <= 3.55
|   |   |   |   |   |   |--- CCAvg <= 3.35
|   |   |   |   |   |   |   |--- Age <= 37.50
|   |   |   |   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  37.50
|   |   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |   |--- weights: [23.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |   |--- Income <= 83.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Income >  83.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.35
|   |   |   |   |   |   |   |--- Family <= 3.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |   |--- Family >  3.00
|   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.55
|   |   |   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |   |   |--- weights: [43.00, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  81.50
|   |   |   |   |   |   |   |--- Income <= 83.50
|   |   |   |   |   |   |   |   |--- Age <= 45.50
|   |   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  45.50
|   |   |   |   |   |   |   |   |   |--- Mortgage <= 142.50
|   |   |   |   |   |   |   |   |   |   |--- Age <= 54.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Age >  54.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Mortgage >  142.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  83.50
|   |   |   |   |   |   |   |   |--- weights: [24.00, 0.00] class: 0
|   |   |   |--- CD_Account >  0.50
|   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |--- Income >  92.50
|   |   |   |--- Education <= 1.50
|   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |   |   |--- Age <= 55.00
|   |   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  55.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |--- Age <= 47.00
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Age >  47.00
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |--- Income <= 93.50
|   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- Income >  93.50
|   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |--- Education >  1.50
|   |   |   |   |--- Age <= 63.50
|   |   |   |   |   |--- Mortgage <= 172.00
|   |   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |   |--- Age <= 60.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 21.00] class: 1
|   |   |   |   |   |   |   |--- Age >  60.50
|   |   |   |   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |   |--- Age <= 51.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  51.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Mortgage >  172.00
|   |   |   |   |   |   |--- Income <= 100.00
|   |   |   |   |   |   |   |--- Mortgage <= 316.50
|   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Mortgage >  316.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Income >  100.00
|   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- Age >  63.50
|   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|--- Income >  116.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [0.00, 47.00] class: 1
|   |--- Education >  1.50
|   |   |--- weights: [0.00, 222.00] class: 1

In [45]:
# print importance of features in tree building
print (pd.DataFrame(dtree.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
                         Imp
Education           0.401465
Income              0.308336
Family              0.164664
CCAvg               0.045263
Age                 0.044438
CD_Account          0.025711
Mortgage            0.009561
Online              0.000561
ZIPCode             0.000000
Securities_Account  0.000000
CreditCard          0.000000
In [46]:
#plot importance of feature in tree building
importances = dtree.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='teal', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Model Performance Improvement

Pre Pruning

In [47]:
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
## add from article
parameters = {'max_depth': np.arange(3,15),
              'min_samples_leaf': [2,3,5,7,9,10,12,15],
              'max_leaf_nodes' : [2, 3, 5, 10],
              'min_impurity_decrease': [0.001,0.01,0.1]
             }

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(x_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(x_train, y_train)
Out[47]:
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
                       min_impurity_decrease=0.001, min_samples_leaf=2,
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [48]:
#Training data confusion matrix
make_confusion_matrix(estimator,x_train,y_train)
In [49]:
print("Training data prepruned model performance")
dtree_preprun_train_perf = model_performance(estimator, x_train, y_train)
print(dtree_preprun_train_perf)
print("Testing data prepruned model performance")
dtree_preprun_test_perf = model_performance(estimator, x_test, y_test)
print(dtree_preprun_test_perf)
Training data prepruned model performance
   Accuracy    Recall  Precision        F1
0  0.989714  0.927492   0.962382  0.944615
Testing data prepruned model performance
   Accuracy    Recall  Precision        F1
0  0.981333  0.879195   0.929078  0.903448
In [50]:
# prepruned tree
plt.figure(figsize=(20,30))
tree.plot_tree(estimator,feature_names=features,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
In [51]:
#print prepruned tree
print(tree.export_text(estimator,feature_names=features,show_weights=True))
|--- Income <= 116.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2632.00, 10.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account <= 0.50
|   |   |   |   |--- weights: [117.00, 10.00] class: 0
|   |   |   |--- CD_Account >  0.50
|   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |--- Income >  92.50
|   |   |   |--- Education <= 1.50
|   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |--- weights: [33.00, 4.00] class: 0
|   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |--- weights: [1.00, 5.00] class: 1
|   |   |   |--- Education >  1.50
|   |   |   |   |--- weights: [11.00, 28.00] class: 1
|--- Income >  116.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [0.00, 47.00] class: 1
|   |--- Education >  1.50
|   |   |--- weights: [0.00, 222.00] class: 1

In [52]:
# print importances of feature of prepruned tree
print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
                         Imp
Education           0.447999
Income              0.328713
Family              0.155711
CCAvg               0.042231
CD_Account          0.025345
Age                 0.000000
ZIPCode             0.000000
Mortgage            0.000000
Securities_Account  0.000000
Online              0.000000
CreditCard          0.000000
In [53]:
#plot importances of feature of prepruned tree
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='teal', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Post Pruning

In [54]:
#define cost complexity pruning model
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(x_train, y_train)
#get alpha and impurity for the dataset
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
Out[54]:
ccp_alphas impurities
0 0.000000 0.000000
1 0.000223 0.001114
2 0.000268 0.002188
3 0.000359 0.003263
4 0.000381 0.003644
5 0.000381 0.004025
6 0.000381 0.004406
7 0.000409 0.006042
8 0.000476 0.006519
9 0.000527 0.007046
10 0.000582 0.007628
11 0.000593 0.008813
12 0.000641 0.011379
13 0.000769 0.014456
14 0.000882 0.017985
15 0.001552 0.019536
16 0.002333 0.021869
17 0.003024 0.024893
18 0.003294 0.028187
19 0.006473 0.034659
20 0.023866 0.058525
21 0.056365 0.171255
In [55]:
# plot imputity vs alpha for training data
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
In [56]:
# getting alpha of last node
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(x_train, y_train)
    clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
      clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.056364969335601575
In [57]:
#plot alpha vs depth of tree and alpha vs number of nodes
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
In [58]:
#get recall values for training and test data
recall_train = []
for clf in clfs:
     pred_train = clf.predict(x_train)
     train_recall = recall_score(y_train, pred_train)
     recall_train.append(train_recall)

recall_test = []
for clf in clfs:
     pred_test = clf.predict(x_test)
     test_recall = recall_score(y_test, pred_test)
     recall_test.append(test_recall)
In [59]:
#plot alpha vs recall
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel("alpha")
ax.set_ylabel("recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
        drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
        drawstyle="steps-post")
ax.legend()
plt.show()
In [60]:
#get model with max alpha
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0006414326414326415, random_state=1)
In [61]:
# build model with specific alpha and fit in the data
estimator_postprun = DecisionTreeClassifier(ccp_alpha=0.0006414, class_weight='balanced', random_state=1)
estimator_postprun.fit(x_train, y_train)
Out[61]:
DecisionTreeClassifier(ccp_alpha=0.0006414, class_weight='balanced',
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [62]:
#confusion matrix for postpruned training data
make_confusion_matrix(estimator_postprun,x_train,y_train)
In [63]:
#get performance for postpruned training data
print('Performance of postpruned training data')
dtree_postprun_train_perf = model_performance(estimator_postprun, x_train, y_train)
print(dtree_postprun_train_perf)
print('Performance of postpruned training data')
dtree_postprun_test_perf = model_performance(estimator_postprun, x_test, y_test)
print(dtree_postprun_test_perf)
Performance of postpruned training data
   Accuracy  Recall  Precision        F1
0  0.987714     1.0   0.885027  0.939007
Performance of postpruned training data
   Accuracy    Recall  Precision        F1
0     0.972  0.912752   0.824242  0.866242
In [64]:
# postpruned tree
plt.figure(figsize=(20,30))
tree.plot_tree(estimator_postprun,feature_names=features,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
In [65]:
# print postpruned tree
print(tree.export_text(estimator_postprun,feature_names=features,show_weights=True))
|--- Income <= 92.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [1344.67, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- CCAvg <= 3.95
|   |   |   |   |--- Mortgage <= 102.50
|   |   |   |   |   |--- Income <= 68.50
|   |   |   |   |   |   |--- weights: [8.28, 0.00] class: 0
|   |   |   |   |   |--- Income >  68.50
|   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |--- weights: [6.07, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |--- ZIPCode <= 94.50
|   |   |   |   |   |   |   |   |   |--- weights: [7.73, 52.87] class: 1
|   |   |   |   |   |   |   |   |--- ZIPCode >  94.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.21, 0.00] class: 0
|   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |--- weights: [5.52, 0.00] class: 0
|   |   |   |   |--- Mortgage >  102.50
|   |   |   |   |   |--- weights: [11.60, 0.00] class: 0
|   |   |   |--- CCAvg >  3.95
|   |   |   |   |--- weights: [23.19, 0.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- weights: [0.00, 26.44] class: 1
|--- Income >  92.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 103.50
|   |   |   |   |--- CCAvg <= 3.21
|   |   |   |   |   |--- weights: [22.09, 0.00] class: 0
|   |   |   |   |--- CCAvg >  3.21
|   |   |   |   |   |--- ZIPCode <= 92.00
|   |   |   |   |   |   |--- weights: [0.55, 15.86] class: 1
|   |   |   |   |   |--- ZIPCode >  92.00
|   |   |   |   |   |   |--- weights: [2.21, 0.00] class: 0
|   |   |   |--- Income >  103.50
|   |   |   |   |--- weights: [239.11, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 108.50
|   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |--- weights: [6.07, 0.00] class: 0
|   |   |   |   |--- Family >  3.50
|   |   |   |   |   |--- weights: [1.10, 10.57] class: 1
|   |   |   |--- Income >  108.50
|   |   |   |   |--- weights: [1.66, 280.21] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 116.50
|   |   |   |--- CCAvg <= 2.85
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [37.55, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- Age <= 57.50
|   |   |   |   |   |   |--- Age <= 27.50
|   |   |   |   |   |   |   |--- weights: [3.87, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  27.50
|   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |--- ZIPCode <= 93.50
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.87, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.31, 15.86] class: 1
|   |   |   |   |   |   |   |   |--- ZIPCode >  93.50
|   |   |   |   |   |   |   |   |   |--- weights: [4.97, 0.00] class: 0
|   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |--- weights: [2.76, 21.15] class: 1
|   |   |   |   |   |--- Age >  57.50
|   |   |   |   |   |   |--- weights: [4.97, 0.00] class: 0
|   |   |   |--- CCAvg >  2.85
|   |   |   |   |--- weights: [6.63, 153.32] class: 1
|   |   |--- Income >  116.50
|   |   |   |--- weights: [0.00, 1173.72] class: 1

In [66]:
#print importance of feature in postpruned tree
print (pd.DataFrame(estimator_postprun.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
                         Imp
Income              0.646559
Family              0.151146
CCAvg               0.092357
Education           0.086624
CD_Account          0.007647
ZIPCode             0.006300
Mortgage            0.004871
Age                 0.004495
Securities_Account  0.000000
Online              0.000000
CreditCard          0.000000
In [67]:
#plot importance of feature in postpruned tree
importances = estimator_postprun.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='teal', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Model Comparison and Final Model Selection

In [68]:
# get the recall values from different models and compare
comparison_frame = pd.DataFrame({'Model':['Initial decision tree model','Decision tree with restricted maximum depth','Decision tree with post-pruning'],
                                 'Train_Recall':[1.0,0.927,1.0], 'Test_Recall':[0.88,0.879,0.913]})
comparison_frame
Out[68]:
Model Train_Recall Test_Recall
0 Initial decision tree model 1.000 0.880
1 Decision tree with restricted maximum depth 0.927 0.879
2 Decision tree with post-pruning 1.000 0.913

Since the recall is best in post pruned decision tree, post pruned decision tree model is selelected

Actionable Insights and Business Recommendations

  • What recommedations would you suggest to the bank?
  • Customers with higher education has better chances of accepting personal loan offer
  • Next two factors to consider are Income and Family size
  • Bank shall consider Educational loan which is beneficial and attracts more customers
  • Bank will revisit its loan interest rate and reduce to attract more customers to accept loan offers
  • Bank will expand its operation in the areas where there are fewer customers
  • Bank shall optimize marketing strategies
  • Establish partnership with local businesses and help them grow with personalized loan offers
  • Conduct customer surveys and work on feedbacks